智能论文笔记

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Meryem Banu Cavlak , Gagandeep Singh , Mohammed Alser , Can Firtina , Joël Lindegger , Mohammad Sadrosadati , Nika Mansouri Ghiasi , Can Alkan , Onur Mutlu

分类：人工智能 | 机器学习

2022-12-09

Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall.

translated by 谷歌翻译

剖面隐藏的马尔可夫模型（PHMM）广泛用于许多生物信息学应用中，以准确识别生物学序列（例如DNA或蛋白质序列）之间的相似性。 PHMM使用常用和高度精确的方法（称为Baum-Welch算法）来计算这些相似性。但是，Baum-Welch算法在计算上很昂贵，现有作品为固定的PHMM设计提供了软件或仅硬件解决方案。当我们分析最先进的作品时，我们发现迫切需要灵活，高性能和节能的硬件软件共同设计，以有效地有效地解决所有主要效率低下的效率PHMM的Baum-Welch算法。我们提出了APHMM，这是第一个灵活的加速框架，可以显着减少PHMM的Baum-Welch算法的计算和能量开销。 APHMM利用硬件软件共同设计来解决Baum-Welch算法中的主要效率低下，通过1）设计灵活的硬件来支持不同的PHMMS设计，2）利用可预测的数据依赖性模式，并使用chip Memory的片段记忆，使用纪念活动技术，memoigience Memoriques，Memoigience Memoriques，Memoigient， 3）通过基于硬件的过滤器快速消除可忽略的计算，4）最小化冗余计算。我们在专用硬件和2）GPU的软件优化方面实现了我们的1）硬件软件优化，以为PHMM提供首个灵活的Baum-Welch加速器。与Baum-Welch算法的CPU，GPU和FPGA实现相比，APHMM提供的显着加速度为15.55 x-260.03x，1.83x-5.34x和27.97倍，分别为27.97倍。 APHMM的表现优于三个重要的生物信息学应用程序的最新CPU实现，1）错误校正，2）蛋白质家族搜索和3）多个序列对齐，比1.29x-59.94x，1.03x-1.75x和分别为1.03x-1.95x。

translated by 谷歌翻译

本文提出了一个紧凑的系统OpenPneu，以支持软机器人多腔的气动驱动。系统中使用微型泵来生成气流，因此不需要额外的输入，因为需要压缩空气。我们的系统执行模块化设计以提供良好的可扩展性，这已在具有十个空气通道的原型上证明。OpenPNEU的每个空气通道都配备了通货膨胀和通气功能，可提供从正到负的全范围压力供应，最大流速为1.7 L/min。我们的系统内置了对压力的高精度闭环控制，以实现稳定而有效的动态性能。提供了Python中的开源控制接口和API。我们还证明了OpenPneu在三个软机器人系统上的功能，最多10个腔室。

translated by 谷歌翻译